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Abstract 

We propose a multiresolution Gaussian process to capture long-range, non-Markovian 
dependencies while allowing for abrupt changes. The multiresolution GP hierarchically 
couples a collection of smooth GPs, each defined over an element of a random nested 
partition. Long-range dependencies are captured by the top-level GP while the partition 
points define the abrupt changes. Due to the inherent conjugacy of the GPs, one can 
analytically marginalize the GPs and compute the conditional likelihood of the observations 
given the partition tree. This property allows for efficient inference of the partition itself, 
for which we employ graph-theoretic techniques. We apply the multiresolution GP to the 
analysis of Magnetoencephalography (MEG) recordings of brain activity. 



1 Introduction 

A key challenge in many time series applications is capturing long-range dependencies for which 
Markov-based models are insufficient. One method of addressing this challenge is through em- 
ploying a Gaussian process (GP) with an appropriate (non-band-limited) covariance function. 
However, GPs typically assume smoothness properties of the underlying function being modeled 
that can blur key elements of the signal if abrupt changes occur. The Matern kernel enables 
less smooth functions, but assumes a stationary process that does not adapt to varying levels 
of smoothness. Likewise, a changepoint [23] or partition [9] model between smooth functions 
fails to capture long range dependencies spanning changepoint s. 

Another long- memory process is the fractional ARIMA process [5], with extensions to infi- 
nite variance innovations in [15]; however, the appropriateness and robustness of such models for 
real data analysis has been questioned [8]. Wavelet methods have also been proposed, including 
recently for smooth functions with discontinuities [2]. Such methods inherit the properties and 
limitations of wavelet analysis, for example, lack of shift invariance. We take a fundamentally 
different approach based on GPs that allows (i) direct interpretability, (ii) local stationarity, 
(iii) irregular grids of observations, and (iv) sharing information across related time series. 

As a motivating application, consider Magnetoencephalography (MEG) recordings of brain 
activity in response to some word stimulus. Due to the low signal-to-noise-ratio (SNR) regime, 
multiple trials are often recorded, presenting a functional data analysis scenario. Each trial 
results in a noisy trajectory with key discontinuities (e.g., after stimulus onset). Although 
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Fi gure li For sensor 1 and word house, Left: Data from 
three trials; Middle: Correlation matrix estimated from 
20 trials; Right: Hierarchical segmentation produced by 
recursive minimization of normalized cut objective, with 
color indicating tree level. 

there are overall similarities between the replicates, there are also key differences that occur 
based on various physiological phenomena, as depicted in Fig. 1. We clearly see abrupt changes 
as well as long-range correlations. Key to the data analysis is the ability to share information 
about the overall trajectory between the single trials without forcing unrealistic smoothness 
assumptions on the single trials themselves. 

In order to capture both long-range dependencies and potential discontinuities, we propose a 
multiresolution GP (mGP) that hierarchically couples a collection of smooth GPs, each defined 
over an element of a nested partition set. Due to the inherent conjugacy of the GPs, one 
can analytically marginalize the GPs and compute the marginal likelihood of the observations 
conditioned on the partition tree. This conditional likelihood is equivalent to the likelihood 
under a GP with a partition-dependent covariance function. The covariance function encodes 
local smoothness of the function and enables discontinuities at the partition points. The amount 
of correlation between any two observations i/i and yj generated by the mGP at locations Xi 
and Xj is a function of the distance — and which tree levels contain both Xi and Xj. 
The tree thus encodes non-stationarity. 

The fact that there is an analytic form for the conditional distribution of the observations 
given the partition allows for efficient importance sampling of the partition itself. For our 
proposal distribution, we borrow the idea of normalized cuts [24] from the theoretical computer 
science and computer vision communities. We also present an MCMC sampler incorporating 
both local and global moves. Our inferences integrate over the partition tree, allowing blurring 
of discontinuities and producing functions which can appear smooth when discontinuities are 
not present in the data. 

2 Background 

A GP provides a distribution on real-valued functions / : A' ^ 5ft, with the property that 
the function evaluated at any finite collection of points is jointly Gaussian. The GP, denoted 
GP(m,c), is uniquely defined by its mean function m and covariance function c. That is, 
/ ~ GP(m, c) if and only if for all n > 1 and xi, . . . , x^, (/(^i), • • • , /(^n)) ~ ^nl^, K)^ with 
/i = [m(xi), . . . , m(xn)] and [K]ij = c{xi^Xj). The properties (e.g., continuity, smoothness, 
periodicity, etc.) of functions drawn from a given GP are determined by the covariance function. 
The squared exponential kernel, c(x, x') = (iexp(— — x'||2), leads to smooth functions. Here, 
d is a scale hyperparameter and is the bandwidth determining the extent of the correlation 
in / over A*. See [21] for further details. 



Figure 2: mGP: Parent function is 
split by = {^15^2}- Recursing 
down the tree, each partition has a GP 
with mean given by its parent function 
restricted to that set. 



3 Multiresolution Gaussian Process Formulation 



Our interest is in modeling a function g that (i) is locally smooth, (ii) exhibits long-range 
correlations (i.e., coTT{g{x)^ g{x')) > for — relatively large), and (iii) has abrupt changes. 
We begin by modeling a single function, but with a specification that readily lends itself to 
modeling a collection of functions that share a similar global trajectory, as explored in Sec. 4. 

Generative Model Assume a set of noisy observations y = . . . , ?/^}, yi G 3?, of the 
function g at locations {xi, . . . , Xn}, G A' C J?^: 

yi=g{xi)^ei, ei^N{0,a^). (1) 

To capture both long-range dependencies and abrupt changes, we hierarchically define the 
function g as follows. Let A = {A^ , , . . . , A^~^} be a set of nested partitions of X with 
A^ = X and binary splits such that ^ = [JiAf and = A^i^i U ^2i- We define a global 

parent function on A^ as ^ GP(0,c^). Then, over each partition set Aj of the binary tree 
we define 

f{Ai)^GP{f-\Ai),ci). (2) 

That is, the mean of the GP is given by the parent function restricted to the current partition 
set. Finally, we set g = f^~^. A pictorial representation of the mCP is shown in Fig. 2. 

Covariance Function We assume a squared exponential kernel cj = dj ex.p{—K.j\\x — x^Wl)^ 
encouraging local smoothness over each partition set Aj. Allowing dj = enables unbalanced 
trees — the child function exactly equals the parent function. However, we focus on dj = d^ 
with < 1 for finite variance regardless of tree depth and additionally encouraging 

lower levels to vary less from their parent function, providing regularization and robustness to 
the choice of L. 

We typically assume bandwidths tzj = i^/\\Aj\\2 so that each child function is locally as 
smooth as its parent. One can think of this formulation as akin to a fractal process: zooming 
in on any partition, the locally defined function has the same smoothness as that of its parent 
over the larger partition. Thus, lower levels encode finer-resolution details. We denote the 
covariance hyperparameters as 6> = {d^, . . . , d^"^, k.}, and omit the dependency in conditional 
distributions for notational simplicity. 

Induced Marginal GP The conditional independencies of our mGP imply that 

Pis \A)= f Piniipif I f-\A')dr-'^-'. (3) 

Due to the inherent conjugacy of the GPs, one can analytically marginalize the hierarchy of 
GPs conditioned on the partition tree A yielding 

L-1 

g\A^GP{0,c*^), c*_^ = J2J2'^^Ai- (4) 

Here, = 1 if x, G Aj and otherwise. Eq. (4) provides an interpretation of the 

mGP as a (marginally) partition-dependent GP, where the partition A defines the disconti- 
nuities in the covariance function c^. The covariance function encodes local smoothness of g 
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and discontinuities at the partition points. Note that defines a non- stationary covariance 
function. 

The correlation between any two observations yi and yj at locations Xi and Xj generated 
as in Eq. (1) is a function of how many tree levels contain both Xi and xj and the distance 
\\xi — Xj\\. Let rf index the partition set such that Xi G A^^i and Lij be the lowest level for 

which Xi and Xj fall into the same set (i.e., the largest £ such that rf = rj). Then, for Xi ^ Xj 
corr(^i, yj \ A) = (5) 

_ Et:i)^^exp(-^||x.-x,||^/||^^^,||^) 

where the second equality follows from assuming the previously described kernels. An example 
correlation matrix is shown in Fig. 4(c). k determines the width of the bands while controls 
the contribution of level £. Since is square summable, lower levels are less infiuential. 



Marginal Conditional Likelihood Based on a vector of observations y = [yi ■ ■ ■ yn]' at 
locations x = [xi ■ ■ ■ Xn]\ we can restrict our attention to evaluating the GPs at x. Let /^(x) = 
[/^(^i) • • • f^{xn)y ' By definition of the GP, we have 

fi^)\f-H^),A'^N{f-\^),Ke), [i^d.i = | "'^'^^'''^■^ Xer^if ' 

where the covariance matrix Ki is a function of the level-specific partition A^. Likewise, ob- 
servations are generated as y | ^(x) ~ N{g{x.)^o-'^In). Recalling Eq. (3), standard results 
yield 

g{^)\A^NUY,Ki) ylA^NUa^I^^Y^xX (8) 

This result can also be derived from the induced mGP of Eq. (4) . Again note that the marginal 
conditional likelihood is equivalent to the marginal likelihood under a GP with a partition- 
dependent covariance matrix. Alternatively, one can condition on the GP at any level 

y I /'(x),^ ~ 7v(/(x),<72/„ + J2 (9) 

A key advantage of the mGP is the conditional conjugacy of the latent GPs that allows us to 
compute the likelihood of the data simply conditioned on the nested partition A. This fact is 
fundamental to the efficiency of the partition inference procedure described in Sec. 5. 



4 Multiple Trials 

In many apphcations, such as the motivating MEG application, one has a collection of ob- 
servations of an underlying signal. We refer to the replicates as trials. In order to capture 
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Figure 4: (a) Three trials and (b) all 100 trials of data generated from a 5-level mGP with a shared 
parent function /° and partition A (randomly sampled), (c) True correlation matrix, (d) Correlation 
matrix estimated from 100 trials, (e) Hierarchical segmentation produced by recursive minimization 
of normalized cut objective. 

the common global trajectory of these trials while still allowing for trial-specific variability, we 
model each trial as a realization from an mGP with a shared parent function One could 
trivially allow for alternative structures of hierarchical sharing beyond if an application war- 
ranted. For simplicity, and due to the motivating MEG application, we additionally assume 
shared changepoints between the trials, though this assumption can also be relaxed. 



Generative Model For each trial y^-^^ = . . . , ^n^}, we model 




with ^^-^^ = /^~^'(-^) generated from a trial-specific GP hierarchy 
f^'^^^ • • • ^ /^-I'O) with shared parent (Again, alter- 
native structures can be considered.) Equivalently, from Eq. (9) 

with e = 0, and independently for each j ^^g^^^ ^' Graphical 

model of J replicates from 



^ ^ ^ a shared global trajec- 



y(^) ir(x),^^7VfyO);/0(x),a2^ + ^i^A (11) tory f and hierarchical 
^ ^=1 ^ partition A. 

See Fig. 3 for a graphical model. Note that with our GP-based 

formulation, we need not assume coincident observation locations xi , . . . , Xn between the repli- 
cates. However, for simplicity of exposition, we consider shared locations. We compactly denote 
the covariance by E = a^/n + Y^fZi ^i- 

Simulated data generated from a 5-level mGP with shared and A are shown in Fig. 4. 
The sample correlation matrix is also shown. Compare with the MEG data of Fig. 1. Both the 
qualitative structure of the raw time series as well as blockiness of the correlation matrix have 
striking similarities. 

Posterior Global Trajectory and Predictions Based on a set of trials {y'^^^ . . . ,y*^^^}, 
it is of interest to infer the posterior of Standard Gaussian conjugacy results imply that 

p(/°(x) I y(i), . . . ,y(^),^) =n(^{K^' + JS-i)-' y, {K^' + JS"!)"' (12) 
where y = y*^*^. Likewise, the predictive distribution of data from a new trial is 

p(y(j+l) |y(l)^^ ,^^y(j)^^)^ j ^ j + l) ^ ^ ^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^(.l ) ^ 

= n(^ (Xq-i + JS-i)-' y, S + (Xq-i + JS-i) ^ . (13) 



Marginal Conditional Likelihood Since the set of trials Y = {y , • • • , y ^'^^ } are generated 
from a shared parent function the marginal likelihood does not decompose over trials. 
Instead, 

(14) 

See the Appendix for a derivation. One can easily verify that the above simplifies to the 
marginal likelihood of Eq. (8) when J = 1. 

5 Inference of the Hierarchical Partition 

In the formulation so far, we have assumed that the hierarchical partition A is given. A key 
question is to infer the partition from the data. Assume that we have prior ^(.4) on the 
hierarchical partition. Based on the fact that we can analytically compute p{Y \ A)^ we can 
use importance sampling or independence chain Metropolis Hastings to draw samples from the 
posterior p{A \ Y). 

Partition Prior For the hierarchical partition, we consider a prior solely on the partition 
points rather than taking tree level into account as well. Recall that our partition trees hier- 
archically join neighboring contiguous regions in X. Because of our time-series analysis focus, 
we assume A' C 3?. We define a distribution F on X; the prior probability of A with partition 
points {zi, . . . , Z2L-i_i} is given by p{A) = Hi ^(^i)- Generatively, one can think of drawing 
2^~^ — 1 partition points and deterministically forming a balanced binary tree A from these. 
For multidimensional A', one could use Voronoi tessellation and graph matching to build the 
tree from the randomly selected Zi. Such a prior allows for trivial specification of a uniform 
distribution on A (simply taking F uniform on A') or for eliciting prior information on change- 
points, such as based on physiological information for the MEG data. In contrast, eliciting such 
information in a level-dependent setup is not straightforward. Also, despite common deploy- 
ment, a prior that takes the partition point at level £ as uniformly distributed over the parent 
set A^^~^ yields high mass on A with small Af. This property is undesirable because it leads to 
trees with highly unbalanced partitions. 

Our resulting inferences perform Bayesian model averaging over trees. As such, even though 
we specify a prior on partitions with 2^~^ — 1 changepoints, the resulting functions can appear 
to adaptively use fewer by averaging over the uncertainty in the discontinuity location. 

Partition Proposal Although stochastic tree search algorithms tend to be inefficient in gen- 
eral, we can harness the well-defined correlation structure associated with a given hierarchical 
partition to much more efficiently search the tree space. One can think of every observed loca- 
tion Xi as a node in a graph with edge weights between Xi and Xj defined by the magnitude of 
the correlation of i/i and yj. Based on this interpretation, the partition points of A correspond 
to graph cuts that bisect small edge weights, as graphically depicted in Fig. 5. As such, we seek 
a method for hierarchically cutting a graph. Given a cost matrix W with w{u^v) defined for 
all pairs of nodes ii, 'i; in a set V, the normalized cut metric [24] for partitioning V into disjoint 
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sets A and B is given by 




ncut(A, B) = cut(A, B) [assoc(A, V)~^ + assoc(5, V)~^] , (15) 

where cut{A, B) = ^ueA,veB^i^^^) assoc(A, V) = ^ueA,vev ^i^^^)- Typically, the cut 
point is selected as the minimum of the metric ncut(A, 5) computed over all possible subsets 
A and B. The normalized cut metric balances the connectivity measure between A and B with 
the connectivity between the elements in A (or B) and all elements, thus avoiding cuts that 
separate small sets. Fig. 1 shows an example of applying a greedy normalized cut algorithm to 
MEG data. 

Instead of deterministically selecting cut points, we em- 
ploy the normalized cuts objective as a proposal distribution. 
Let the cost matrix W be the absolute value of the empirical 
correlation matrix computed from replicates {y^^\ • • • , y^'^^} 
(see Fig. 1). Due to the natural ordering of our locations time 

Xi G A' C 3?, the algorithm is straightforwardly implemented. ^. r m ^ r ^ 

' ^ f . in • r An ■ Figure 5: Illustration of cut- 

We^step down the hierarchy, first proposing a cut of into ^^.^^^ ^^.^^^ contiguous seg- 

{^1,^2} with probability ^^^^g p^i^^g correla- 

11 111 tion. 

q{{A\,Al})^ncnt{A\,Air\ (16) 

At level each Aj is partitioned via a normalized cut proposal based on the submatrix of 
W corresponding to the locations Xi G Af. The probability of any partition A under the 
specified proposal distribution is simply computed as the product of the sequence of conditional 
probabilities of each cut. This procedure generates cut points only at the observed locations 
Xi. More formally, the partition point in A' is proposed as uniformly distributed between Xi 
and Xi-^i. Extensions to multidimensional A' rely on spectral clustering algorithms based on 
the graph Laplacian [31]. 

Markov Chain Monte Carlo An importance sampler draws hierarchical partitions A^^^ ~ 

with the proposal distribution q defined as above, and then weig hts the samples by p(^(^) ) /g(^(^) ) 
to obtain posterior draws [22]. Such an approach is naively parallelizable, and thus amenable 
to eflicient computations, though the effective sample size may be low if q does not adequately 
match the posterior p{A \ Y). Alternatively, a straightforward independence chain Metropolis 
Hastings algorithm (see Appendix) is defined by iteratively proposing A' ^ q which is accepted 
with probability min{r(^^ | ^4), 1} where ^ is a previous sample of a hierarchical partition and 

r{A' I A) = p{Y I A')p{A')q{A)/\p{Y \ A)p{A)q{A')]. (17) 

The tailoring of the proposal distribution q to this apphcation based on normalized cuts dra- 
matically aids in improving the acceptance rate relative to more naive tree proposals. However, 
the acceptance rate tends to decrease as higher posterior probability partitions A are discov- 
ered, especially for many level trees defined over large spaces X for which the search space is 
larger. 

One benefit of the MCMC approach over importance sampling is the ability to include more 
intricate tree proposals to increase efficiency. We choose to interleave both local and global tree 
proposals. At each iteration, we first randomly select a node in the tree (i.e., a partition set 
Aj) and then propose a new sequence of cuts for all children of this node. When the root node 
is selected, corresponding to A^^ the proposal is equivalent to the global proposals previously 
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considered. We adapt the proposal distribution for node selection to encourage more global 
searches at first and then shift towards a greater balance between local and global searches 
as the sampling progresses. Sequential Monte Carlo methods [4] can also be considered, with 
particles generated as global proposals. 

Computational Complexity The per iteration complexity is O(n^), equivalent to a typical 
likelihood evaluation under a GP prior. Using dynamic programming, the cost associated 
with the normalized cut proposal is 0{in?{L — 1)). Standard techniques for more efficient 
GP computations are readily applicable, as well as extensions that harness the additive block 
structure of the covariance. 

6 Related Work 

Various aspects of the mGP have similarities to other models proposed in the literature that 
primarily fall into two main categories: (i) GPs defined over a partitioned input space, and 
(ii) collections of GPs defined at tree nodes. The treed GP [9] captures non-stationarities 
by defining independent GPs at the leaves of a Bayesian CART-partitioned input space. The 
related approach of [14] assumes a Voronoi tessellation. These methods capture abrupt changes, 
but do not allow for long-range dependencies nor a functional data hierarchical structure, 
both inherent to our multiresolution perspective. Instead, a main motivation is the resulting 
computational speed-ups of an independently partitioned GP. A two-level hierarchical GP also 
aimed at computational efficiency is considered by [19]. This GP approximation takes the 
upper level to be a coarse-scale GP defined over the centroids of the k-means (or known) input 
partition, and the lower level to be GPs over each partition set with constant mean given by 
the parent GP at the given centroid. The partition model of [23] examines online inference of 
changepoints with GPs modeling the data within each segment. 

[12] consider covariance functions defined on a phylogenetic tree such that the covari- 
ance between function-valued traits depends on both their spatial distance and evolutionary 
time spanned via a common ancestor. [11] additionally model the phylogenetic tree via the 
coalescent-based Bayesian hierarchical clustering (BHC) of [29]. Here, the tree defines the 
strength and structure of sharing between a collection of functions rather than abrupt changes 
within the function. The Bayesian rose tree of [3] focuses on BHC with arbitrary branching 
structure. An application is considered where internal nodes are GPs and the data result from 
a mixture of GPs at the leaves. Such an approach is fundamentally different from the mGP: 
observations are not necessarily spatially clustered and each node in the tree defines a GP over 
the entire input space. Non-tree-structured mixtures of GP experts were considered in [17, 20]. 
Alternatively, multiscale processes have a long history (cf. [32]): the variables define a Markov 
process on a typically balanced, binary tree and higher-level nodes capture coarser level in- 
formation about the process. In contrast, the higher level nodes in the mGP share the same 
temporal resolution and only vary in smoothness. 

At a high level, the mGP differs from previous GP-based tree models in that the nodes of 
our tree represent GPs over a contiguous subset of the input space X (or induced Gaussian 
random vectors of varying dimension) constrained in a hierarchical fashion. Thus, the mGP 
combines ideas of GP-based tree models and GP-based partition models. 

One can formulate an mGP as an additive GP where each GP in the sum decomposes 
independently over the level-specific partition of the input space X. The additive GPs of [6] 
instead focus on coping with multivariate inputs, in a similar vain to hierarchical kernel learn- 
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Figure 6: For the data of Fig. 4, (a) true and (b) MAP partitions, (c) Trace plots of log likelihood 
versus MCMC iteration for 10 chains. Log likelihood under the true partition (cyan) and minimized 
normalized cut partition of Fig. 4 (magenta) are also shown, (d) Errors between posterior mean /° 
and true /° for GP, hGP, and mGP. (e) Predictive log likelihood of 10 heldout sequences for GP, hGP, 
and mGP with L = 2, 5(true),7, 10. 

ing [1]. Thus, additive CPs address an inherently different task. Another formulation related 
in title, but fundamentally different is the hierarchical GP latent variable model of [16]. The 
formulation takes latent variables at nodes of a fixed tree that are related via GP mappings. 

Finally, the hierarchical partitions are reminiscent of those associated with Polya trees, and 
clearly density estimation and regression are closely coupled tasks. However, the standard 
Polya tree assumes a fixed partitioning scheme. More recently, randomized and unbalanced 
partitions were considered in the optional Polya tree [33], but with a scheme tailored to the 
density estimation task. 



7 Results 

7.1 Synthetic Experiments 

To assess our ability to infer a hierarchical partition via the proposed MCMC sampler, we 
generated 100 replicates of length 200 from a 5-level mGP with a shared parent function 
The hyperparameters were set to = 0.1, k = 10, = dP exp(— 0.5(^ + 1)) for £ = 0, . . . , L — 2 
with = b. The data are shown in Fig. 4, along with the sample correlation matrix that is 
used as the cost matrix for the normalized cuts proposals. 

For inference, we set = (j^/3 and d^ = ((j^/3) exp(— 0.5^), where is the average 
time-specific sample variance, was as in the simulation. The hyperparameter mismatch 
demonstrates some robustness to mispecification. For a uniform prior p(A), 10 independent 
MCMC chains were run for 3000 iterations, thinned by 10. The first 1000 iterations used pure 
global tree searches; the sampler was then tempered to uniform node proposals. The effects of 
this choice are apparent in the likelihood plot of Fig. 6, which also displays the true hierarchical 
partition and MAP estimate. Compare to the normalized cut partition of Fig. 4, especially at 
the important level 1 cut. The full simulation study took less than 7 minutes to run on a single 
1.8 GHz Intel Core 17 processor. 

To assess sensitivity to the choice of L, we compare the predictive log-likelihood of 10 
heldout test sequences under an mGP with 2, 5, 7, and 10 levels. As shown in Fig. 6(e), there 
is a clear gain going from 2 to 5 levels. However, overestimating L has minimal influence on 
predictive likelihood since lower tree levels capture finer details and have less overall effect. We 
also compare to a single GP and a 2-level hierarchical GP (hGP) (see Sec. 7.2). For a direct 
comparison, both use squared exponential kernels. Hyperparameters were set as in the mGP 
for the top-level GP. The total variance was also matched with the GP taking this as noise 
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Figure 7: Per- lobe comparison of mGP to (a) GP and (b) hGP: For various values of r, % decrease 
in predictive MSE of heldout 2/?:r+3o conditioned on 2/i:t-i and 15 training sequences, (c) For a 
visual cortex sensor and word hammer, plots of test data, empirical mean (MLE), and hGP and mGP 
predictive mean for entire heldout y*. (d) Boxplots of predictive log likelihood of y* for the mGP and 
wavelet-based method of [18]. Plots aggregate results over 5 heldout sequences y* per word. 



and the hGP splitting between level 2 and noise. In addition to better predictive performance, 
Fig. 6(d) shows the mCP's improved estimation of 



7.2 MEG Analysis 

We analyzed magnetoencephalography (MEG) recordings of neuronal activity collected from a 
helmet with gradiometers distributed over 102 locations around the head^. The gradiometers 
measure the spatial gradient of the magnetic activity in Teslas per meter (T/m) [10]. Since 
the firings of neurons in the brain only induce a weak magnetic field outside of the skull, the 
signal-to- noise ratio of the MEG data is very low and typically multiple recordings, or trials^ 
of a given task are collected. Our MEG data was recorded while a subject viewed 20 stimuli 
describing concrete nouns (both the written noun and a representative line drawing), with 20 
interleaved trials per word. These concrete nouns fall into four categories: animals, buildings, 
food and tools. 

Efficient sharing of information between the single trials is important for tasks such as 
word classification [7]. A key insight of [7] was the importance of capturing the time- varying 
correlations between MEG sensors for performing classification. However, the formulation still 
necessitates a mean model. [7] propose a 2-level hierarchical GP (hGP): a parent GP captures 
the common global trajectory, as in the mGP, and each trial-specific GP is centered about 
the entire parent function^. This formulation maintains global smoothness at the individual 
trial level. The mGP instead models the trial-specific variability with a multi-level tree of GPs 
defined as deviations from the parent function over local partitions, allowing for abrupt changes 
relative to the smooth global trajectory. 

For our analyses, we only consider the building ("apartment", "barn", "church", "igloo", 
"house") and tool ("chisel", "hammer", "pliers", "saw", "screwdriver") words. Independently 
for each of the 10 words and 102 sensors, we trained a 5-level mGP using 15 randomly selected 
trials as training data and the 5 remaining for testing. Each trial was of length n = 340. We 
ran 3 independent chains of the MCMC sampler for 3000 iterations with both global and local 
tree searches. We discarded the first 1000 samples as burn-in and thinned by 10, resulting in 
600 hierarchical partition samples ^^^^ . The mGP hyperparameters were set exactly as in the 
simulated study of Sec. 7.1 for structure learning and then optimized over a grid to maximize 
the marginal likelihood of the training data. 

■"^The MEG machine has three sensors at each helmet position: two gradiometers and one magnetometer. 
For simplicity, we only consider one gradiometer per location. 

^The model of [7] uses an hGP in a latent space. The mGP could be similarly deployed. 
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We compare the predictive performance of the mGP 
in terms of MSE of heldout segments relative to a GP 
and hOP, each with similarly optimized hyperparam- 
eters. The predictive mean conditioned on data up 
to the heldout time is straightforwardly derived from 
Eq. (13). For the mGP, the calculation is averaged 
over the posterior hierarchical partition samples ^^^^ . 
Fig. 7 displays the MSEs decomposed by cortical re- 
gion. The results clearly indicate that the mGP con- 
sistently better captures the features of the data, and 
particularly for sensors with large abrupt changes such 
as in the visual cortex. The heldout replicates for a 
visual cortex sensor are displayed in Fig. 7(c). Rela- 
tive to the hGP, the mGP much better tracks the early 
jump in activity right after the visual stimulus onset 
{t = 0). The posterior distribution of inferred change- 
points at level 1, also broken down by cortical region, 
are displayed in Fig. 8. As expected, the visual cortex 
has the earliest changepoints. Similar trends are seen in the parietal lobe that handles per- 
ception and sensory integration. The temporal lobe, which is key in the semantic processing, 
has changepoints occurring later. These results concur with the findings of [25]: semantic 
processing starts between 250 and 600 ms and word length (a visual feature) is decoded most 
accurately very near the standard 100ms response time ("nlOO"). 

We also compare our predictive performance to that of the wavelet-based functional mixed 
model (wfmm) of [18]. The wfmm has become a standard approach for functional data analysis 
since it allows for spiky trajectories and efficient sharing of information between trials. One 
limitation, however, is the restriction to a regular grid of observations. Although the wfmm 
enables analysis in a multivariate setting by incorporating both fixed and random effects, for a 
direct comparison, we simply apply the wfmm to each word and sensor independently. Fig. 7(d) 
shows boxplots of the predictive heldout log likelihood of the test trials under the mGP and 
wfmm. In addition to the easier interpretability of the mGP, the predictive performance also 
exceeds that of the wfmm. 

8 Discussion 

The mGP provides a generative framework for characterizing the dependence structure of real 
data, such as the examined MEG recordings, capturing certain features more accurately than 
previous models. In particular, the mGP provides a hierarchical functional data analysis frame- 
work for modeling (i) strong, locally smooth sharing of information, (ii) global long-range cor- 
relations, and (iii) abrupt changes. The simplicity of the mGP formulation enables further 
theoretical analysis, for example, combining posterior consistency results from changepoint 
analysis with those for GPs. Although we focused on univariate time series analysis, our for- 
mulation is amenable to multivariate functional data analysis extensions: one can naturally 
accommodate hierarchical dependence structures through partial sharing of parents in the tree, 
or possibly via mGP factor models. Another interesting extension is to incorporate the mGP 
within a functional ANOVA framework [13]. 

There are many interesting questions relating to the proposed covariance function. Our 
fractal specification represents a particular choice to avoid over-parameterization, although 
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Figure 8: Inferred changepoints at 
level 1 aggregated over sensors within 
each lobe: visual (top-left), frontal (top- 
right), parietal (bottom-left), and tempo- 
ral (bottom-right). 
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alternatives could be considered. For hyperparameter inference, we anticipate that joint sam- 
pling with the partition would mix poorly, and consider it a topic for future exploration. We 
believe that the proposed mGP represents a powerful, broadly applicable new framework for 
non-stationary analyses, especially in a functional data analysis setting, and sets the foundation 
for many interesting possible extensions. 
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A Derivation of Marginal Conditional Likelihood 

We derive the marginal likelihood as follows. Throughout, we use /^(x) to compactly denote 
f^{xi:n)' As discussed in the paper, each trial can be described as an independent draw 

y(^") |/°(x),^~Arfy(^");/0(x),sY (18) 



where E = (T^/„ + X]i^=/ ^^'^ /°(^) ~ -^(Oj Kq). Therefore, the joint distribution of Y = 
{y^^\---,y^''^} is given by: 

piY I /0(x),^)p(/0(x)) = cfcaexp (^-i ^(y(') - /0(x))'E-i(y(^) - /°(x)) - ^f'K^'fi^)^ 
= cicexp (^-i^yW's-V« +/'e-i ^y« - \f'iJE-'+K^')f{^)^ 
= c(c2 exp l^-i ^yW's-V^') - ^(/°(x) - ^)'(JE-i + i^o"')(/°(x) - 0) + ^0'(JE-i + 



(19) 



where Ci = ((27r)"/2|S|i/2)-i, C2 = ((27r)"/2|ifo|'/')-\ and = (JE"! + K^')-^^-^ E.Y^'^- 
Then, p{Y \ A) = J p{Y \ f (x) , A)p{f {x))df° (x) is derived as 



p{Y I A) = cfc2exp l^-i^yW's-iyW + i<^'(JE 

I exp (-^(/"(x) - (/.)'(JE-i + i^o"')(/°W - 0)) df{x^..M) (20) 
= cfc2exp l^-i^yW'E-V(^) + l<P'{J^-'+K^')<^^ ■ (27r)»/2|JE-i 

(21) 

= (27r)-"-^/2|^or'/'isi-'/vs-' +-?fo"'r'/' 

exp (^-^^yW'E-V^^) + l<l^'iJ^-'+K-'U^ . (22) 
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B Details on MEG Experiments 



For the MEG experiments, the data from each word w and sensor p were treated independently. 
That is, each assumed a unique partition structure for the mGP. Additionahy, for ah models 
the hyperparameters were set in a training-data-driven fashion. 

B.l Hyperparameter Optimization 

The following describes how the hyperparameter optimization is performed for the MEG com- 
parisons. In all scenarios, the input space A' = [1 : 340] was first normalized to take values in 
[0:1], as was the case for the simulated study. 

Gaussian Process The GP was specified as follows. The covariance function was taken to 
be a squared exponential, which for word w and sensor p took the form Cw,p = d^^p ex.p{—hz\ \x — 
x^WD' The scale parameter was constrained to be a fixed linear function of the average time- 
specific sample variance ^ of the training data: dw,p = a^a^^. Likewise, the nugget noise 
was of the form ^ = f^d-'^^p- The parameters tz^ a^, and /3 were optimized on a grid to 
maximize the marginal likelihood of the training data over all words and sensors. 

Hierarchical GP For the 2-level hierarchical GP (hGP), a squared exponential kernel was 
also assumed for both levels. As in [7], a single bandwidth parameter was assumed. In par- 
ticular, for the shared top level GP, ^ = exp(— — a;'||2), and for the trial-specific 
level, cl^ p = dl^ p ex.-p{— i<i\\x — x'\\2). As in the GP, the hyperparameters were constrained as 
a function of a^^^: d^ ^ = a^a^ ,p, ,p = a-^a^ ,p, and a^^ ^ = Pd-'^^p- Here, k,, a^, a^, and /3 
were jointly optimized to maximize marginal likelihood. For numerical reasons, the minimum 
allowable nugget noise was set to 1% of a'^ p (i.e., f3 = 0.01). 

Multiresolution GP For the multiresolution GP (mGP), the covariance function was con- 
strained in a similar manner to the simulation study. In particular, a shared bandwidth k 
was assumed. The scale parameters {d^, d^, . . . , d^~^} were constrained as follows. The global 
parent GP was assigned scale d^ = a^a'^ p. The scale parameters of the L — 1 trial-specific 
levels of the mGP hierarchy were constrained by a fixed functional form as in the simulated 
data setup, determined by two parameters. In particular, d^ = [a^ exp(— p * i)]d-'^^p. Finally, 
the nugget noise followed ^ = f^o-'^^p- In this scenario, tz^ a^, a^, p, and /3 were jointly op- 
timized to maximize marginal likelihood based on initial samples of tree partitions ^^^^ using 
the hyperparameter settings of the simulated data example. Again, for numerical reasons, the 
minimum allowable nugget noise was set to 1% of d-'^^p. 
The optimized hyperparameter values were as follows: 









a' 




p 


GP 


1350 


0.15 




1 




hGP 


13000 


0.033 


1 


0.01 




mGP 


900 


0.1 


1.67 


0.01 


1.1 



For the hGP, a large bandwidth (low temporal correlation) is taken to account for the abrupt 
changes. Also note that the parent GP was given little variance and instead the variance was 
attributed to the lower level GP to account for the significant trial-to-trial variation. The GP 
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accounts for both the large trial-to-trial variability and the abrupt changes through a large 
nugget noise. The mGP variance dropped off fairly rapidly with tree level, as indicated by p. 
Note that both the hGP and mGP are able to account for trial-to-trial variability in the tree 
hierarchy instead of through the nugget noise, as indicated by low values of j3. 

In our optimization procedure, we found that the performance of the GP was the most 
sensitive to changes in the hyperparameter specification. Both the hGP and GP were fairly 
robust over a reasonably large range of settings (partially indicated by a fiat marginal likelihood 
of the training data over the range.) 

B.2 MEG Prior Settings 

The hierarchical partition prior ^(.4), determined by F on A* as described in Sec. 5, was set 
as follows for a given word w and sensor p. All of the training data associated with sensor p 
except for that of the considered word was used to produce a recursively minimized normalized 
cut partition for a 4- level tree. The associated strength of each cut (i.e., amount of empirical 
correlation cut) was also recorded. F was then defined as a kernel-smoothed version of the cut 
points and associated cut strengths, along with baseline mass at all points. This prior was used 
to mimic the information that might be garnered from a domain expert. Experiments were also 
run under a uniform prior and produced nearly identical results after burn-in. The aggregated 
posterior changepoints samples, depicted in Fig. 8 only for level 1, were clearly different from 
the prior setting, demonstrating learning of the partition points (not dominated by the prior 
setting). 

B.3 Additional Figures 

We provide some additional figures related to the MEG results presented in Fig. 7 of the main 
paper. 

In Fig. 9, we display examples of the heldout test data for the visual cortex sensor 77 and 

6 different words. Each plot only shows the first heldout trial even though the results of Fig. 

7 perform a full analysis on each of the 5 heldout trials. For r = 70, we show the predictive 
mean ?/*.^^3q conditioned on yl.^_i and 15 training sequences. We compare the performance of 
the mGP to that of the hGP. Only the predictive mean is displayed for clarity. The predictive 
variances associated with the hGP were similar to, but slightly larger than those of the mGP 
(7% larger on average). The 95% predictive intervals included the heldout observations in all 
6 cases for the mGP and in 5 cases for the hGP. However, note the significantly better mean 
predictions for the mGP. 

B.4 MEG Data Acquisition 

Subjects gave their written informed consent approved by University of Pittsburgh (protocol 
PRO09030355) and Carnegie Mellon (protocol HS09-343) Institutional Review Boards. MEG 
data were recorded using an Elekta Neuromag device (Elekta Oy). While the machine has 306 
sensors, to reduce the dimension of the data, only recordings from the second gradiometers 
were used for these experiments. The data was acquired at 1 kHz, high-pass filtered at 0.1 
Hz and low-pass filtered at 330 Hz. Eye movements (horizontal and vertical eye movements as 
well as blinks) were monitored by recording the differential activity of muscles above below and 
beside the eyes. At at the beginning of each session we recorded the position of the participants 
head with four head position indicator (HPI) coils placed on the subject's scalp. The HPI coils. 
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Figure 9: Heldout test data for the visual cortex sensor 77 and 6 different words. For r = 70, we show 
the predictive mean 2/?:t+3o under an hGP and mGP conditioned on 2/i:t-i and 15 training sequences. 

along with three cardinal points (nasian, left and right pre-auricular), were digitized into the 
system. 

The data were preprocessed using the Signal Space Separation method (SSS) [27, 28] and 
temporal extension of SSS (tSSS) [26] to remove artifacts and noise unrelated to brain activity. 
In addition, we used tSSS to realign the head position measured at the beginning of each block 
to a common location. The MEG signal was then low-pass filtered to 50 Hz to remove the 
contributions of line noise and down-sampled to 200 Hz. The Signal Space Projection method 
(SSP) [30] was then used to remove signal contamination by eye blinks or movements, as well 
as MEG sensor malfunctions or other artifacts. Each MEG repetition starts 260 ms before 
stimulus onset, and ends 1440 ms after stimulus onset, for a total of 1.7 seconds and 340 time 
points of data per sample. MEG recordings are known to drift with time, so we correct our 
data by subtracting the mean signal amplitude during the 200ms before stimulus onset, for each 
sensor /repetition pair. Because the magnitude of the MEG signal is very small, we multiply 
the signal by 10^^ to avoid numerical precision problems. 

C MCMC Sampler Pseudocode 

We assume (i) a cost matrix W formed from the absolute value of the empirical correlation 
matrix of a set of training replicates, (ii) a prior F on the partition points, and (iii) hyperpa- 
rameters = {k, dP , . . . , d^~^ ^ a'^} defining the mGP kernel bandwith, variances, and nugget 
noise, respectively. The sampler is initialized with a hierarchical partition A drawn from the 
normalized cut proposal q. The covariance matrix E is a deterministic function of the hier- 
archical partition A and the hyperparameters 0. In what follows, we use kernel to define 
a function that provides this mapping for a given partition set at a given tree level. The 
likelihood p{Y | E,6>) = p{Y \ A^6) is computed exactly as in Eq. (13) of the main paper. Al- 
gorithm 1 details the global search iterations and Algorithm 2 the local (which can also produce 
global searches if the root note is selected). The local search algorithm additionally assumes a 
node-proposal distribution indicated by nodeproposal. 
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Algorithm 1 One Iteration of mGP MCMC Sampler - GLOBAL SEARCH 
Input: Cost matrix W, input locations A', hyperparameters 6>, 

previous partition A and corresponding S 
{zi, . . . Z2L-i_i} ^ partition points of A 

A^ = X ^ Yj' = Onxn 

for^ = l,...,L-l do 
for = 1 : 2 : 2^ do 



initialize structures for proposal 

normalized cut proposal 

add Ki submatrix corresponding to A^ 
add submatrix corresponding to A^j^i 



_i} ^ partition points of Al 



p - Ber(min(r(^' | ^),1)) 

A^ pA! ^(\- p)A, 
Output: A^ S 



t[A! I A) 



p(y|s,i 
E ^ pE' + (1 - p)E 



accept or reject proposal 



Algorithm 2 One Iteration of mGP MCMC Sampler - LOCAL SEARCH 
Input: Cost matrix W, input locations A', hyperparameters 6>, 

previous partition A and corresponding E 
{zi, . . . Z2i.-i_i} ^ partition points of A 
E' ^ E, A! ^A 
A^% ^ nodeproposal 

S^{ii^,i)\A'J cAil} 
for {iy,£) eS do 

= S'(X') - kernel(X^^,^) 
for (z/, £) G S such that z^ is odd do 

{A'J,A%^}-q{-\A'^-+\y„W) 

E'iA'J) = E'iA'J) + kernel(^:^ 9, £) 
^'{AX,) = ^'{A',\,)+kernel{AX„e,£) 
{z[^ . . . Z2L-i_i} ^ partition points of A! 



initialize proposals to previous values 
select a set (tree node) to repartition 
node descendants 

remove contributions from node descendants 

normalized cut proposal 

add submatrix corresponding to A^ 
add Ki submatrix corresponding to A^j^i 



p - Ber(min(r(^' | A), 1)), r{A' \ A) 



p(y|s,( 



A^ pA' ^{1- p)A, 
Output: ^, E 



E ^ pE' + (1 - p)E 



accept or reject proposal 
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